Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new nvtext tokenized minhash API #17944

Draft
wants to merge 20 commits into
base: branch-25.04
Choose a base branch
from

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented Feb 7, 2025

Description

Creates a new minhash API that works on ngrams of row elements given a list column of strings.

std::unique_ptr<cudf::column> minhash_ngrams(
  cudf::lists_column_view const& input,
  cudf::size_type ngrams,
  uint32_t seed,
  cudf::device_span<uint32_t const> parameter_a,
  cudf::device_span<uint32_t const> parameter_b,
  rmm::cuda_stream_view stream,
  rmm::device_async_resource_ref mr);

The input column is expected to be rows of words (strings) and each row is hashed using a sliding window of words (ngrams) and then the permuted algorithm is re-used to produce the minhash values.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Feb 7, 2025
@davidwendt davidwendt self-assigned this Feb 7, 2025
Copy link

copy-pr-bot bot commented Feb 7, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added Python Affects Python cuDF API. pylibcudf Issues specific to the pylibcudf package labels Feb 11, 2025
@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Feb 11, 2025
@davidwendt
Copy link
Contributor Author

/ok to test

@davidwendt
Copy link
Contributor Author

/ok to test

@davidwendt
Copy link
Contributor Author

Python example usage:

import cudf
import numpy as np

params = cudf.Series([1, 2, 3], dtype=np.uint32)
strings = cudf.Series([["this", "is", "my"], ["favorite", "book", "today"]])
results = strings.str.minhash_ngrams(ngrams=2, seed=0, a=params, b=params)
print(results)
0      [416367548, 832735096, 1249102644]
1    [1408797893, 2817595786, 4226393679]
dtype: list

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API. strings strings issues (C++ and Python)
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

1 participant